Machine Learning for Genomic Sequence Analysis - Dissertation -

نویسندگان

  • Sören Sonnenburg
  • Thomas Magedanz
چکیده

Acknowledgements Above all, I would like to thank Dr. Gunnar Rätsch and Prof. Dr. Klaus-Robert Müller for their guidance and inexhaustible support, without which writing this thesis would not have been possible. All of the work in this thesis has been done at the Fraunhofer Institute FIRST in Berlin and the Friedrich Miescher Laboratory in Tübingen. I very much enjoyed the inspiring atmosphere in the IDA group headed by K.-R. Müller and in G. Rätsch's group in the FML. Much of what we have achieved was only possible in a joint effort. As the various fruitful within-group collaborations expose — we are a team. reading the draft, helpful discussions and moral support. I acknowledge the support from all members of the IDA group at TU Berlin and Fraunhofer FIRST and the members of the Machine Learning in Computational Biology group at the Friedrich Miescher Laboratory, especially for the tolerance of letting me run the large number of compute-jobs that were required to perform the experiments. Finally, I would like to thank the system administrators became utterly excited about this subject. During that seminar, I gave an introductory talk on SVMs to an all-knowing audience, but it took a while before I seriously picked up the SVM subject in 2002. For my student research project, Gunnar Rätsch suggested Hidden Markov Models (HMMs) and their application to Bioinformatics. The idea was to find genes, but as this problem is too complex to start with, I started with the sub-problem of recognising splice sites. HMMs using a certain predefined topology performed quite well. However, we managed to improve on this using string kernels (derived from HMMs) and Support Vector Machines (Sonnenburg et al., 2002, Tsuda et al., 2002b). Unfortunately, training SVMs with these string kernels was computationally very costly. Even though we had a sample of more than 100,000 data points, we could barely afford training on 10,000 instances on a, at that time, cutting-edge Compaq Alpha compute cluster. In contrast to HMMs, the resulting trained SVM classifiers were not easily accessible. After a research break of about 2 years during which I became the group's system administrator, I started work on exactly these — still valid — topics. All of this research was application driven: improving the splicing signal (and later, other signals) detection in order to construct a gene finder that is applicable on a genomic scale. To this end, I …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the Performance and Precision of Bioinformatics Algorithms

Title of dissertation: Improving the Performance and Precision of Bioinformatics Algorithms Xue Wu, Doctor of Philosophy, 2008 Dissertation directed by: Professor Chau-Wen Tseng Department of Computer Science Recent advances in biotechnology have enabled scientists to generate and collect huge amounts of biological experimental data. Software tools for analyzing both genomic (DNA) and proteomic...

متن کامل

An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

We study the problem of domain transfer for a supervised classification task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic sequence data from model organisms of varying evolutionary distance. We find that in cases where the organisms are not closely related, the use of domain adaptation...

متن کامل

Two meta-heuristic algorithms for parallel machines scheduling problem with past-sequence-dependent setup times and effects of deterioration and learning

This paper considers identical parallel machines scheduling problem with past-sequence-dependent setup times, deteriorating jobs and learning effects, in which the actual processing time of a job on each machine is given as a function of the processing times of the jobs already processed and its scheduled position on the corresponding machine. In addition, the setup time of a job on each machin...

متن کامل

Combining Multi-Species Genomic Data for MicroRNA Identification Using a Naïve Bayes Classifier Machine Learning for Identification of MicroRNA Genes

Motivation: Numerous computational methodologies utilize techniques based on sequence conservation and/or structural similarity for microRNA gene prediction. In this study we describe a new technique, which is applicable across several species, for predicting microRNA genes. This technique is based on machine learning, using the Naïve Bayes classifier. This computational procedure automatically...

متن کامل

Protein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches

DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...

متن کامل

Lot Streaming in No-wait Multi Product Flowshop Considering Sequence Dependent Setup Times and Position Based Learning Factors

This paper considers a no-wait multi product flowshop scheduling problem with sequence dependent setup times. Lot streaming divide the lots of products into portions called sublots in order to reduce the lead times and work-in-process, and increase the machine utilization rates. The objective is to minimize the makespan. To clarify the system, mathematical model of the problem is presented. Sin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009